Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

Update: 2022-01-23

Description

Recorded by Robert Miles: http://robertskmiles.com

More information about the newsletter here: https://rohinshah.com/alignment-newsletter/

YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg

HIGHLIGHTS

Alignment difficulty (Richard Ngo and Eliezer Yudkowsky) (summarized by Rohin): Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His argument in this dialogue is roughly as follows:

1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world.

2. In either case, the AI system must be producing high-impact, world-rewriting plans; such plans are "consequentialist" in that the simplest way to get them (and thus, the one we will first build) is if you are forecasting what might happen, thinking about the expected consequences, considering possible obstacles, searching for routes around the obstacles, etc. If you don't do this sort of reasoning, your plan goes off the rails very quickly - it is highly unlikely to lead to high impact. In particular, long lists of shallow heuristics (as with current deep learning systems) are unlikely to be enough to produce high-impact plans.

3. We're producing AI systems by selecting for systems that can do impressive stuff, which will eventually produce AI systems that can accomplish high-impact plans using a general underlying "consequentialist"-style reasoning process (because that's the only way to keep doing more impressive stuff). However, this selection process does not constrain the goals towards which those plans are aimed. In addition, most goals seem to have convergent instrumental subgoals like survival and power-seeking that would lead to extinction. This suggests that we should expect an existential catastrophe by default.

4. None of the methods people have suggested for avoiding this outcome seem like they actually avert this story.

Richard responds to this with a few distinct points:

1. It might be possible to build AI systems which are not of world-destroying intelligence and agency, that humans use to save the world. For example, we could make AI systems that do better alignment research. Such AI systems do not seem to require the property of making long-term plans in the real world in point (3) above, and so could plausibly be safe.

2. It might be possible to build general AI systems that only state plans for achieving a goal of interest that we specify, without executing that plan.

3. It seems possible to create consequentialist systems with constraints upon their reasoning that lead to reduced risk.

4. It also seems possible to create systems with the primary aim of producing plans with certain properties (that aren't just about outcomes in the world) -- think for example of corrigibility (AN #35) or deference to a human user.

5. (Richard is also more bullish on coordinating not to use powerful and/or risky AI systems, though the debate did not discuss this much.)

Eliezer's responses:

1. AI systems that help with alignment research to such a degree that it actually makes a difference are almost certainly already dangerous.

2. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn't the one we actually meant, and we don't understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous.

3 and 4. It is certainly possible to do such things; the space of minds that could be designed is very large. However, it is difficult to do such things, as they tend to make consequentialist reasoning weaker, and on our current trajectory the first AGI that we build will probably not look like that.

This post has also been summarized by others here, though with different emphases than in my summary.

Rohin's opinion: I first want to note my violent agreement with the notion that a major scary thing is "consequentialist reasoning", and that high-impact plans require such reasoning, and that we will end up building AI systems that produce high-impact plans. Nonetheless, I am still optimistic about AI safety relative to Eliezer, which I suspect comes down to three main disagreements:

1. There are many approaches that don't solve the problem, but do increase the level of intelligence required before the problem leads to extinction. Examples include Richard's points 1-4 above. For example, if we build a system that states plans without executing them, then for the plans to cause extinction they need to be complicated enough that the humans executing those plans don't realize that they are leading to an outcome that was not what they wanted. It seems non-trivially probable to me that such approaches are sufficient to prevent extinction up to the level of AI intelligence needed before we can execute a pivotal act.

2. The consequentialist reasoning is only scary to the extent that it is "aimed" at a bad goal. It seems non-trivially probable to me that it will be "aimed" at a goal sufficiently good to not lead to existential catastrophe, without putting in much alignment effort.

3. I do expect some coordination to not do the most risky things.

I wish the debate had focused more on the claim that non-scary AI can't e.g. do better alignment research, as it seems like a major crux. (For example, I think that sort of intuition drives my disagreement #1.) I expect AI progress looks a lot like "the heuristics get less and less shallow in a gradual / smooth / continuous manner" which eventually leads to the sorts of plans Eliezer calls "consequentialist", whereas I think Eliezer expects a sharper qualitative change between "lots of heuristics" and that-which-implements-consequentialist-planning.

Discussion of "Takeoff Speeds" (Eliezer Yudkowsky and Paul Christiano) (summarized by Rohin): This post focuses on the question of whether we should expect AI progress to look discontinuous or not. It seemed to me that the two participants were mostly talking past each other, and so I'll summarize their views separately and not discuss the parts where they were attempting to address each other's views.

Some ideas behind the "discontinuous" view:

1. When things are made up of a bunch of parts, you only get impact once all of the parts are working. So, if you have, say, 19 out of 20 parts done, there still won't be much impact, and then once you get the 20th part, then there is a huge impact, which looks like a discontinuity.

2. A continuous change in inputs can lead to a discontinuous change in outputs or impact. Continuously increasing the amount of fissile material leads to a discontinuous change from "inert-looking lump" to "nuclear explosion". Continuously scaling up a language model from GPT-2 to GPT-3 leads to many new capabilities, such as few-shot learning. A misaligned AI that is only capable of concealing 95% of its deceptive activities will not perform any such activities; it will only strike once it is scaled up to be capable of concealing 100% of its activities.

3. Fundamentally new approaches to a problem will often have prototypes which didn't have much impact. The difference is that they will scale much better, and so once they start having an impact this will look like a discontinuity in the rate of improvement on the problem.

4. The evolution from chimps to humans tells us that there is, within the space of possible mind designs, an area in which you can get from shallow, non-widely-generaliz

Comments

In Channel

Alignment Newsletter #173: Recent language model results from DeepMind

2022-07-2116:43

Alignment Newsletter #172: Sorry for the long hiatus!

2022-07-0505:52

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

2022-01-2314:21

Alignment Newsletter #170: Analyzing the argument for risk from power-seeking AI

2021-12-0813:01

Alignment Newsletter #169: Collaborating with humans without human data

2021-11-2415:08

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

2021-10-2816:21

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

2021-10-2017:10

Alignment Newsletter #166: Is it crazy to claim we're in the most important century?

2021-10-0815:42

Alignment Newsletter #165: When large models are more likely to lie

2021-09-2216:05

Alignment Newsletter #164: How well can language models write code?

2021-09-1518:40

Alignment Newsletter #163: Using finite factored sets for causal and temporal inference

2021-09-0819:27

Alignment Newsletter #162: Foundation models: a paradigm shift within AI

2021-08-2715:46

Alignment Newsletter #161: Creating generalizable reward functions for multiple tasks by learning a model of functional similarity

2021-08-2017:38

Alignment Newsletter #160: Building AIs that learn and think like people

2021-08-1317:26

Alignment Newsletter #159: Building agents that know how to experiment, by training on procedurally generated games

2021-08-0427:00

Alignment Newsletter #158: Should we be optimistic about generalization?

2021-07-2915:39

Alignment Newsletter #157: Measuring misalignment in the technology underlying Copilot

2021-07-2314:17

Alignment Newsletter #156: The scaling hypothesis: a plan for building AGI

2021-07-1614:17

Alignment Newsletter #155: A Minecraft benchmark for algorithms that learn without reward functions

2021-07-0812:43

Alignment Newsletter #154: What economic growth theory has to say about transformative AI

2021-06-3016:05

00:00

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

#box-pro-ellipsis-176556356543541{-webkit-line-clamp:2;}Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

HIGHLIGHTS

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"

Alignment Newsletter #171: Disagreements between alignment "optimists" and "pessimists"